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DISCUSSION PAPER 
EQUI-ENERGY SAMPLER WITH APPLICATIONS IN 
STATISTICAL INFERENCE AND STATISTICAL MECHANICS 1 ' 2 ' 3 

By S. C. Kou, Qing Zhou and Wing Hung Wong 

Harvard University, Harvard University and Stanford University 

We introduce a new sampling algorithm, the equi-energy sampler, 
for efficient statistical sampling and estimation. Complementary to 
the widely used temperature-domain methods, the equi-energy sam- 
pler, utilizing the temperature-energy duality, targets the energy di- 
rectly. The focus on the energy function not only facilitates efficient 
sampling, but also provides a powerful means for statistical estima- 
tion, for example, the calculation of the density of states and micro- 
canonical averages in statistical mechanics. The equi-energy sampler 
is applied to a variety of problems, including exponential regression 
in statistics, motif sampling in computational biology and protein 
folding in biophysics. 

1. Introduction. Since the arrival of modern computers during World 
War II, the Monte Carlo method has greatly expanded the scientific horizon 
to study complicated systems ranging from the early development in compu- 
tational physics to modern biology. At the heart of the Monte Carlo method 
lies the difficult problem of sampling and estimation: Given a target distri- 
bution, usually multidimensional and multimodal, how do we draw samples 
from it and estimate the statistical quantities of interest? In this article, we 
attempt to introduce a new sampling algorithm, the equi-energy sampler, to 
address the problem. Since the Monte Carlo method began from calculations 
in statistical physics and mechanics, to introduce the equi-energy sampler, 
we begin from statistical mechanics. 
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The starting point of a statistical mechanical computation is the energy 
function or Hamiltonian h(x). According to Boltzmann and Gibbs, the dis- 
tribution of a system in thermal equilibrium at temperature T is described 
by the Boltzmann distribution, 

(!) Pt(x) = -^=r exp(-h(x)/T), 

where Z(T) = ^ x exp(— /i(x)/T) is referred to as the partition function. For 
any state function g(x), its expectation /U 9 (T) with respect to the Boltzmann 
distribution is called its Boltzmann average, also known as the thermal av- 
erage in the physics literature, 

(2) Ms (T) = $>(x)ex P (-%)/T)/Z(T). 

X 

To study the system, in many cases we are interested in using Monte Carlo 
simulation to obtain estimates of Boltzmann averages as functions of tem- 
perature for various state functions. In addition to Boltzmann averages, 
estimating the partition function Z(T), which represents the dependency of 
the normalization factor in (1) as a function of temperature, is also of sig- 
nificant interest, as it is well known that many important thermodynamic 
quantities, such as free energy, specific heat, internal energy and so on, can 
be computed directly from the partition function (see Section 4). The fun- 
damental algorithm for computing Boltzmann averages is due to Metropolis 
et al. [29], who proposed the use of a reversible Markov chain constructed in 
such a way so as to guarantee that its stationary distribution is the Boltz- 
mann distribution (1). Later, this algorithm was generalized by Hastings [11] 
to allow the use of an asymmetric transition kernel. Given a current state x, 
this Metropolis-Hastings algorithm generates a new state by either reusing 
the current state x or moving to a new state y drawn from a proposal ker- 
nel K(x — > y). The proposal state y is accepted with probability min(l, MR) 
where MR is the Metropolis-Hastings ratio pT{y)K{y — > x)/pj-(x)K(x — > y). 
The algorithm in this way generates a Markov chain Xi, i = 1, . . . , n. Under 
ergodic conditions [39], the time average ra _1 ^"=i9(^) provides a consis- 
tent estimate of the Boltzmann average (2). 

The Metropolis algorithm, however, can perform poorly if the energy 
function has many local minima separated by high barriers that cannot be 
crossed by the proposal moves. In this situation the chain will be trapped 
in local energy wells and will fail to sample the Boltzmann distribution cor- 
rectly. To overcome this problem one can design specific moves that have a 
higher chance to cut across the energy barrier (e.g., the conditional sampling 
moves in Gibbs sampling) or to add auxiliary variables so that the energy 
wells become connected by the added dimension (e.g., the group Ising up- 
dating of Swendsen and Wang [37], or the data-augmentation technique of 
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Tanner and Wong [38]). However, these remedies are problem-specific and 
may or may not work for any given problem. A breakthrough occurred with 
the development of the parallel tempering algorithm by Geyer [8] (also called 
exchanged Monte Carlo; Hukushima and Nemoto [13]). The idea is to per- 
form parallel Metropolis sampling at different temperatures. Occasionally 
one proposes to exchange the states of two neighboring chains (i.e., chains 
with adjacent temperature levels). The acceptance probability for the ex- 
change is designed to ensure that the joint states of all the parallel chains 
evolve according to the Metropolis-Hastings rule with the product distri- 
bution (i.e., the product of the Boltzmann distributions at the different 
temperatures) as the target distribution. Geyer's initial objective was to use 
the (hopefully) faster mixing of the high-temperature chains to drive the 
mixing of the whole system, thereby to achieve faster mixing at the low- 
temperature chain as well. It is clear that parallel tempering can provide 
estimates of Boltzmann averages at the temperatures used in the simu- 
lation. Marinari and Parisi [27] developed simulated tempering that uses 
just a single chain but augments the state by a temperature variable that 
is dynamically moved up or down the temperature ladder. These authors 
further developed the theory of using the samples in the multiple tempera- 
tures to construct estimates of the partition function, and to investigate 
phase transition. In the meantime, Geyer [9] also proposed a maximum 
likelihood approach to estimate the ratios of normalization constants and 
hence obtain information on the partition function at the selected temper- 
atures. 

In contrast to the statistical mechanical computations, the starting point 
in statistical inference is usually one given distribution, for example, the 
distribution on a high-dimensional parameter space. If we take the energy 
to be the negative log-density function in this case, we are then interested 
in obtaining the Boltzmann average only at T = 1. The Metropolis and 
related algorithms have been developed and applied to solve many statistical 
computation problems, and have greatly enhanced our ability to analyze 
problems ranging from image analysis to missing value problems to biological 
sequence analysis to single-molecule chemistry [6, 7, 17, 20, 22, 38]. However, 
as Geyer, Marinari, Parisi and others have pointed out, even if the immediate 
interest is at T = 1, simulation at temperatures other than T = 1 is often 
necessary in order to achieve efficient sampling. Furthermore, computing 
the normalization constants (i.e., partition function) is also important in 
statistical tasks such as the determination of likelihood ratios and Bayes 
factors [9, 10, 28]. 

It is thus seen that historically dynamic Monte Carlo methods were de- 
veloped to simulate from the Boltzmann distribution at fixed temperatures. 
These methods aim to provide direct estimates of parameters such as Boltz- 
mann averages and partition functions, which are functions of temperature. 
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We hence refer to these methods as temperature- domain methods. The pur- 
pose of this article is to develop an alternative sampling and estimation ap- 
proach based on energy- domain considerations. We will construct algorithms 
for the direct estimation of parameters such as microcanonical averages and 
density of states (see Section 2) that are functions of energy. We will see 
in Section 2 that there is a duality between temperature-domain functions 
and energy-domain functions, so that once we have obtained estimates of 
the density of states and microcanonical averages (both are energy-domain 
functions), we can easily transfer to the temperature domain to obtain the 
partition function and the Boltzmann averages. In Section 3 we introduce 
the equi-energy sampler (EE sampler), which, targeting energy directly, is 
a new Monte Carlo algorithm for the efficient sampling from multiple en- 
ergy intervals. In Sections 4 and 5 we explain how to use these samples 
to obtain estimates of density of states and microcanonical averages, and 
how to extend the energy-domain method to estimate statistical quantities 
in general. In Section 6 we illustrate the wide applicability of this method 
by applying the equi-energy sampler and the estimation methods to a vari- 
ety of problems, including an exponential regression problem, the analysis 
of regulatory DNA motifs and the study of a simplified model for protein 
folding. Section 7 concludes the article with discussion and further remarks. 

2. Energy-temperature duality. The Boltzmann law (1) implies that the 
conditional distribution of the system given its energy h(x) = u is the uni- 
form distribution on the equi-energy surface {x:h(x) =u}. In statistical 
mechanics, this conditional distribution is referred to as the microcanoni- 
cal distribution given energy u. Accordingly, the conditional expectation of 
a state function g{x) given an energy level u is called its microcanonical 
average: 



Note that (3) is independent of the temperature T used in the Boltzmann 
distribution for X. Suppose that the infinitesimal volume of the energy slice 
{x : h(x) £ (u,u + du)} is approximately equal to Q(u) du. This function Q(u) 
is then called the density of states function. If the state space is discrete, 
then we replace the volume by counts, in which case fi(ii) is simply the 
number of states with the energy equal to u. Without loss of generality, 
we assume that the minimum energy of the system ii m i n = 0. The following 
result follows easily from these definitions. 

Lemma 1. Let (3 = 1/T denote the inverse temperature so that the Boltz- 
mann averages and partition function are indexed by (3 as well as by T; then 



(3) 



u g (u) = E(g(X)\h(X)=u). 
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In particular, the partition function Z(/3 1 ) and the density of states Q(u) 
form a Laplace transform pair. 

This lemma suggests that the Boltzmann averages and the partition func- 
tion can be obtained through Monte Carlo algorithms designed to compute 
the density of states and microcanonical averages. We hence refer to such 
algorithms as energy-domain algorithms. 

The earliest energy-domain Monte Carlo algorithm is the multicanoni- 
cal algorithm due to Berg and Neuhaus [2], which aims to sample from a 
distribution flat in the energy domain through an iterative estimation up- 
dating scheme. Later, the idea of iteratively updating the target distribution 
was generalized to histogram methods (see [18, 40] for a review). The main 
purpose of these algorithms is to obtain the density of states and related 
functions such as the specific heat. They do not directly address the estima- 
tion of Boltzmann averages. 

In this article we present a different method that combines the use of 
multiple energy ranges, multiple temperatures and step sizes, to produce 
an efficient sampling scheme capable of providing direct estimates of all 
microcanonical averages as well as the density of states. We do not use 
iterative estimation of density of states as in the multicanonical approach; 
instead, the key of our algorithm is a new type of move called the equi-energy 
jump that aims to move directly between states with similar energy (see the 
next section). The relationship between the multicanonical algorithm and 
the equi-energy sampler will be discussed further in Section 7. 

3. The equi-energy sampler. 

3.1. The algorithm. In Monte Carlo statistical inference one crucial task 
is to obtain samples from a given distribution, often known up to a nor- 
malizing constant. Let ir(x) denote the target distribution and let h(x) be 
the associated energy function. Then ir(x) oc exp(— h(x)). For simple prob- 
lems, the famous Metropolis-Hastings (MH) algorithm, which employs a 
local Markov chain move, could work. However, if tt(x) is multimodal and 
the modes are far away from each other, which is often the case for practical 
multidimensional distributions, algorithms relying on local moves such as 
the MH algorithm or the Gibbs sampler can be easily trapped in a local 
mode indefinitely, resulting in inefficient and even unreliable samples. 

The EE sampler aims to overcome this difficulty by working on the energy 
function directly. First, a sequence of energy levels is introduced: 

(4) H <H 1 <H 2 <---<H K < H K+1 = oo, 

such that Hq is below the minimum energy, Hq < inf x h(x). Associated with 
the energy levels is a sequence of temperatures 

1 = T <T 1 <---<T K . 
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Fig. 1. Illustration of the equi- energy jump, where the sampler can jump freely between 
the states with similar energy levels. 

The EE sampler considers K + 1 distributions, each indexed by a tempera- 
ture and an energy truncation. The energy function of the ith distribution 
7Tj (0 < i < K) is hi(x) = jr(h(x) V Hi), that is, iti{x) oc exp(— hi(x)). For 
each i, a sampling chain targeting 7Tj is constructed. Clearly ttq is the ini- 
tial distribution of interest. The EE sampler employs the other K chains to 
overcome local trapping, because for large i the energy truncation and the 
high temperature on hi{x) flatten the distribution 7Tj(x), making it easier 
to move between local modes. The quick mixing of chains with large i is 
utilized by the EE sampler, through a step termed the equi-energy jump, to 
help sampling from 7Tj with small i, where the landscape is more rugged. 

The equi-energy jump, illustrated in Figure 1, aims to directly move 
between states with similar energy levels. Intuitively, if it can be imple- 
mented properly in a sampling algorithm, it will effectively eliminate the 
problem of local trap. The fact that the ith energy function hi(x) is mono- 
tone in h(x) implies that, above the truncation values, the equi-energy sets 
Si(H) = {x : hi(x) = H} are mutually consistent across i. Thus once we have 
constructed an empirical version of the equi-energy sets at high-order 7Tj 
(i.e., 7Tj with large i), these high-order empirical sets will remain valid at 
low-order 7Tj. Therefore, after the empirical equi-energy sets are first con- 
structed at high order 7Tj, the local trapping at low-order 7Tj can be largely 
evaded by performing an equi-energy jump that allows the current state 
to jump to another state drawn from the already constructed high-order 
empirical equi-energy set that has energy level close to the current state. 
This is the basic idea behind the EE sampler. We will refer to the empirical 
equi-energy sets as energy rings hereafter. 

To construct the energy rings, the state space X is partitioned according 

to the energy levels, X = Uj=o D j> where D j = { x : K x ) G [ H ji #i+i)}> < 
j < K, are the energy sets, determined by the energy sequence (4). For any 
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x 6 X , let I{x) denote the partition index such that I{x) = j, if x £ Dj, that 
is, if h{x) G [Hj, Hj + i). 

The EE sampler begins from an MH chain X^ K ^ targeting the highest- 
order distribution ttk- After an initial burn-in period, the EE sampler starts 
constructing the .ffth-order energy rings Dj by grouping the samples 

according to their energy levels; that is, D - consists of all the samples 

X { n K) such that I(xi K) ) = j. After the chain xW has been running for N 
steps, the EE sampler starts the second highest-order chain X^ K ~^ target- 
ing 7Tk-i, while it keeps on running and updating L>f ] (0<j<K). 
The chain X( K ~^ is updated through two operations: the local move and 
the equi-energy jump. At each update a coin is flipped; with probability 
1 — Pee the current state Xn ^ undergoes an MH local move to give the 
next state X^_ 1 1 \ and with probability p ee , Xn ^ goes through an equi- 
energy jump. In the equi-energy jump, a state y is chosen uniformly from the 
highest-order energy ring Dj K ^ indexed by j = I(Xn K ^) that corresponds 
to the energy level of X^ K ^ [note that y and X^ K ^ have similar energy 
level, since I(y) = I(X n K ^) by construction]; the chosen y is accepted to 
be the next state X^_ 1 ^ with probability min(l, nK - 1 ^J^^ n 1^- if y [ s 

not accepted, X^_ 1 ^ keeps the old value X n K . After a burn-in period 
on X^ K ~~ l \ the EE sampler starts the construction of the second highest- 
order [i.e., (K — l)st order] energy rings Dj ^ in much the same way as 
the construction of Dj\ that is, collecting the samples according to their 
energy levels. Once the chain X^ K ~ 1 ^ has been running for N steps, the EE 
sampler starts X^ K ~ 2 ^ targeting ttk-2 while it keeps on running 
and JfW. Like A^" 1 ), the chain X^ K ~ 2 ^ is updated by the local MH move 
and the equi-energy jump with probabilities 1 — p cc and p ce , respectively. In 

i{x}*-"y 



the equi-energy jump, a state y uniformly chosen from D t ,^k-2) j where 



Xn is the current state, is accepted to be the next state A^ +1 with 
probability min(l, nK y J^-i) " )■ The EE sampler thus successively 

moves down the energy and temperature ladder until the distribution ttq is 
reached. Each chain X®, < i < K, is updated by the equi-energy jump 
and the local MH move; the equi-energy move proposes a state y uniformly 
chosen from the energy ring D^ i+1 )^ and accepts the proposal with prob- 

ability min(l, EiM^tli^n h _^_t each chain X^ % \ the energy rings L)f^ are 
constructed after a burn-in period, and will be used for chain X^ l ~ 1 ^ in the 
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equi-energy jump. The basic sampling scheme can be summarized as follows. 



let 



The algorithm of the EE sampler 

(i) " (i) 

Assign Xq an initial value and set D-—0 for all i and j 

For n = 1,2, . .. 

For i = K downto 

Mn>{K-i){B + N) do 

(* B is the burn- in period, and N is the 

period to construct initial energy rings *) 

]£i = K or if = do 

'(41) 

(i) 

perform an MH step on X^—i with 

(i) 

target distribution 7Tj to obtain X„ 

else 

with probability 1 — p ee , perform an MH step 

on X^-i targeting 7Tj to obtain X$ 
with probability p ee , uniformly pick a state y from D^ t+1 ™ and 



A"A ; <- y with prob min i, (<) — ; 

-^i-i with the remaining prob. 

endif 

if t)(B + J\r) + -B do 

(* add the sample to the energy rings after the burn-in period 



endif 
endif 
endfor 
endfor 



The idea of moving along the equi-energy surface bears a resemblance to 
the auxiliary variable approach, where to sample from a target distribution 
tt(x) , one can iteratively first sample an auxiliary variable U ~ Unif [0, tt(X)\ , 
and then sample X ~ Unifjx :ir(x) > U}. This approach has been used by 
Edwards and Sokal [5] to explain the Swendsen-Wang [37] clustering al- 
gorithm for Ising model simulation, and by Besag and Green [3] and Hig- 
don [12] on spatial Bayesian computation and image analysis. Roberts and 
Rosenthal [32], Mira, Moller and Roberts [30] and Neal [31] provide fur- 
ther discussion under the name of slice sampling. Under our setting, this 
auxiliary variable (slice sampler) approach amounts to sampling from the 
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lower-energy sets. In comparison, the EE sampler's focus on the equi-energy 
set is motivated directly from the concept of micro canonical distribution in 
statistical mechanics. More importantly, the equi-energy jump step in the 
EE sampler offers a practical means to carry out the idea of moving along 
the energy sets (or moving horizontally along the density contours) , and thus 
provides an effective way to study not only systems in statistical mechanics, 
but also general statistical inference problems (see Sections 4, 5 and 6). 



3.2. The steady-state distribution. With help from the equi-energy jump 
to address the local trapping, the EE sampler aims to efficiently draw sam- 
ples from the given distribution tt (which is identical to ttq). A natural 
question is then: In the long run, will the EE samples follow the correct 
distribution? 

The following theorem shows that the steady-state distribution of chain 
X^> is indeed ivf, in particular, the steady-state distribution of X^ is ttq = 

TT. 



Theorem 2. Suppose (i) the highest-order chain X( K ' is irreducible 
and aperiodic, for i = 0,1, . . . , K — 1, the MH transition kernel Tj^jj of 
connects adjacent energy sets in the sense that for any j there exist sets 
A\ C Dj, A2 C Dj, B\ C Dj^i and B 2 C Dj + \ with positive measure such 
that the transition probabilities 

T$ u (A 1 ,B 1 )>0, T$ u (A 2 ,B 2 )>0 

and (iii) the energy set probabilities pp = Pn^X & Dj) > for all i and j. 
Then X^ is ergodic with TTi as its steady-state distribution. 



Proof. We use backward induction to prove the theorem. 

For i = K, X^ K ' simply follows the standard MH scheme. The desired 
conclusion thus follows from the fact that X^ K ^ is aperiodic and irreducible. 

Now assume the conclusion holds for the (i + l)st order chain, that is, as- 
sume is ergodic with steady-state distribution vrj + i. We want to show 
that the conclusion also holds for X® . According to the construction of the 

(i) 

EE sampler, if at the nth step X n = x, then at the next step with proba- 
bility 1 — Pee will be drawn from the transition kernel T$ n (x,-), and 

(i) * 

with probability p ce ^n+i wm ^ e ecma l to a y from with probability 

{ n+1 ~ V) ~\b^\ \ 'n(x)n i+1 (y))' yeDl ^ ■ 

I I(x) I 
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Therefore for any measurable set A, the conditional probability 

p(x { :leA\x^ = x ,x^) 

= (l-p cc )T$ n (x,A) 

7ri(y)7r i+1 (x 



+ p, 



1 



I(y£ A) mm 1, 



+ Pe 



\b [i+1) 



Y minll, 



7ri(x)vr i+ i(y) 



■Ki(x)ir i+ i(y) 



I(x€A). 



Using the induction assumption of the ergodicity of X^ %+1 ^ and also the fact 
that the lower-order chain X® does not affect the higher-order chain X^ l+1 ^ , 
we have, as n — > oo, 



P(X« 1 e A\X® 



J P(X^ +1 G A\X® = x,X^)dP(X^\X^ 

P{x® e A\x® =x,x^)dP(x^) 



x) 



(5) 



(l- PeB )T^{x,A) 
1 



Pc 



Pl{x) 



1 



7r i+ i(y)min 1 



1 



Similarly, as n — > oo , the difference 



7Tj + i(y)min 1 



TTi(y)TT i+ i(x) 
TTi(x)lli + l{y) 

7ri(y)7r i+1 {x) 
7ri(x)vr i+ i(y) 



dy 



x (i) x (i)^ 



I(x€A). 



0. 



Now let us define a new transition kernel 5^'(x, •), which undergoes the 



transition T, 



MH 



'cc,-) with probability 1 — p ce , and with probability p CE under- 
goes an MH transition with the proposal density q(x,y) = iii + i{y)I(y S 

Dn x \), that is, 7Tj + i(y) confined to the energy set D^ x y We then note 
that the right-hand side of (5) corresponds exactly to the transition kernel 
S^\x, •). Therefore, under the induction assumption, X^ is asymptotically 
equivalent to a Markovian sequence governed by S^(x,-). 

Since the kernel T^^x,-) connects adjacent energy sets and the pro- 
posal q(x, y) connects points in the same equi-energy set, it follows from 
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Chapman-Kolmogorov and < p cc < 1 that S^(x, •) is irreducible. S^'(x, ■) 
is also aperiodic because the proposal q(x, y) has positive probability to leave 
the configuration x staying the same. 

Since S^'(x, •) keeps 7Tj as the steady-state distribution, it finally follows 
from the standard Markov chain convergence theorem and the asymptotic 
equivalence (5) that is ergodic with 7Tj as its steady-state distribution. 
The proof is thus terminated. □ 

Remark 3. The assumption (ii) is weaker than assuming that is 
irreducible for i = 0,1, . . . , K — 1, because we can see that essentially the 
function of the MH local move is to bridge adjacent energy sets, while the 
equi-energy jump allows jumps within an equi-energy set. 

3.3. Practical implementation. There are some flexibilities in the practi- 
cal implementation of the EE sampler. We provide some suggestions based 
on our own experience. 

1. The choice of the temperature and energy ladder. 

Given the lowest and second highest energy levels Hq and Hk, we 
found that setting the other energy levels by a geometric progression, or 
equivalently setting log(-ffj + i — Hi) to be evenly spaced, often works quite 
well. The temperature could be chosen such that (-Hj+i — Hi)/Ti c, and 
we found that c € [1,5] often works well. 

2. The choice of K, the number of temperature and energy levels. 

The choice of K depends on the complexity of the problem. More 
chains and energy levels are usually needed if the target distribution is 
high-dimensional and multimodal. In our experience K could be roughly 
proportional to the dimensionality of the target distribution. 

3. The equi-energy jump probability p ee . 

In our experience taking p cc S [5%, 30%] often works quite well. See 
Section 3.4 for more discussion. 

4. Self-adaptation of the MH-proposal step size. 

As the order i increases, the distribution 7Tj becomes more and more 
flat. Intuitively, to efficiently explore a flat distribution, one should use a 
large step size in the MH proposal, whereas for a rough distribution, the 
step size has to be small. Therefore, in the EE sampler each chain 
should have its own step size in the local MH exploration. In practice, 
however, it is often difficult to choose the right step sizes in the very 
beginning. One can hence let the sampler tune by itself the step sizes. For 
each chain, the sampler can from time to time monitor the acceptance 
rate in the MH local move, and increase (decrease) the step size by a 
fixed factor, if the acceptance rate is too high (low). Note that in this 
self-adaptation the energy-ring structure remains unchanged. 
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5. Adjusting the energy and temperature ladder. 

In many problems, finding a close lower bound Hq for the energy func- 
tion h(x) is not very difficult. But in some cases, especially when h{x) is 
difficult to optimize, one might find during the sampling that the energy 
value at some state is actually smaller than the pre-assumed lower bound 
Hq. If this happens, we need to adjust the energy ladder and the temper- 
atures, because otherwise the energy sets Dj would not have the proper 
sizes to cover the state space, which could affect the sampling efficiency. 
The adjustment can be done by dynamically monitoring the sampler. 
Suppose after the ith chain X^ is started, but before the (i — l)st chain 
gets started, we find that the lowest energy value H m \ n reached so far is 
smaller than Hq. Then we first reset Hq = H m i n — (3, where the constant 
(3 > 0, say (5 = 2. Next given Hi and the new Hq we reset the in-between 
energy levels by a geometric progression, and if necessary add in more 
energy levels between Hq and Hi (thus adding more chains) so that the 
sequence Hj + \ — Hj is still monotone increasing in j. The temperatures 
between Tq = 1 and Tj are reset by (Hj + i — Hj)/Tj ps c. With the energy 
ladder adjusted, the samples are regrouped to new energy rings. Note 
that since the chains X^ K \X^ K ~ l \ . . . have already started, we do 
not change the values of Hk, ■ ■ ■ ,Hi and Tr, . . . , Tj, so that the target 
distributions ttk, ■ ■ ■ are not altered. 

3.4. A multimodal illustration. As an illustration, we consider sampling 
from a two-dimensional normal mixture model taken from [23], 

20 f 1 1 

(6) /( x ) = ^_^ eX p|___( x -/i i )'(x-/i i )j, 

where a\ = ■ ■ ■ = 020 = 0.1, it?i = • • • = W20 = 0.05, and the 20 mean vectors 

, ,_/2.18 8.67 4.24 8.41 3.93 3.25 1.70 

U*i»A*2»---> - y 576 . 959 » g 48 > L6g , g. 8 2 ' 3.47 ' 0.50 ' 

4.59 6.91 6.87 5.41 2.70 4.98 1.14 
5.60 ' 5.81 ' 5.40 ' 2.65 ' 7.88 ' 3.70 ' 2.39 ' 

8.33 4.93 1.83 2.26 5.54 1.69 
9.50' 1.50' 0.09' 0.31' 6.86' 8.11 

Since most local modes are more than 15 standard deviations away from the 
nearest ones [see Figure 2(a)], this mixture distribution poses a serious chal- 
lenge for sampling algorithms, and thus serves as a good test. We applied the 
EE sampler to this problem. Since the minimum value of the energy function 
h(x) = -log(/(x)) is around -log^) = 0.228, we took Hq = 0.2. K was 
set to 4, so only five chains were employed. The energy levels H\, . . . ,H^ 
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were set by a geometric progression in the interval [0,200]. The settings for 
energy levels and temperature ladders are tabulated in Table 1. The equi- 
energy jump probability p ee was taken to be 0.1. The initial states of the 
chains X® were drawn uniformly from [0, l] 2 , a region far from the centers 
/Lt l5 /i 2 , • ■ • , so as to make the sampling challenging. The MH proposal is 
taken to be bivariate Gaussian: X^ ± ~ N 2 (X$ ,t?I 2 ), where the initial MH 
proposal step size Tj for the ith order chain was taken to be 0.25\/7I. 
The step size was finely tuned later in the algorithm such that the accep- 
tance ratio was in the range (0.22,0.32). After a burn-in period, each chain 
was run for 50,000 iterations. Figure 2 shows the samples generated in each 
chain: With the help of the higher-order chains, where the distributions are 
more flat, all the modes of the target distribution were successfully visited by 
X^ . The number of samples in each energy ring is reported in Table 1. One 
can see that for low-order chains the samples are mostly concentrated in the 
low-energy rings, while for high-order chains more samples are distributed 
in the high-energy rings. 

As a comparison, we also applied parallel tempering (PT) [8] to this prob- 
lem. The PT procedure also adopts a temperature ladder; it uses a swap 
between neighboring temperature chains to help the low-temperature chain 
move. We ran the PT to sample from (6) with the same parameter and ini- 
tialization setting. The step size of PT was tuned to make the acceptance 
ratio of the MH move between 0.22 and 0.32. The exchange (swap) probabil- 
ity of PT was taken to be 0.1 to make it comparable with p cc = 0.1 in the EE 
sampler, and in each PT exchange operation, K = 4 swaps were proposed to 
exchange samples in neighboring chains. The overall acceptance rates for the 
MH move in the EE sampler and parallel tempering were 0.27 and 0.29, re- 
spectively. In the EE sampler, the acceptance rate for the equi-energy jump 
was 0.82, while the acceptance rate for the exchange operation in PT was 
0.59. Figure 3(a) shows the path of the last 2000 samples in X^ for the EE 
sampler, which visited all the 20 components frequently; in comparison PT 
only visited 14 components in the same number of samples [Figure 3(b)]. As 

Table 1 
Sample size of each energy ring 



Energy rings 



Chain 




< 2.0 


[2.0,6.3) 


[6.3,20.0) 


[20.0,63.2) 


> 63.2 


X<°\To 


= 1 


41631 


8229 


140 










= 2.8 


21118 


23035 


5797 


50 





X (2) ,T 2 


= 7.7 


7686 


16285 


22095 


3914 


20 


X (3) ,T 3 


= 21.6 


3055 


6470 


17841 


20597 


2037 




= 60.0 


1300 


2956 


8638 


20992 


16114 
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Fig. 2. Samples generated from each chain of the EE sampler. The chains are sorted in 
ascending order of temperature and energy truncation. 
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a further test, we ran the EE sampler and the PT 20 times independently, 
and sought to estimate the mean vector (EXi, EX2) and the second moment 
(EXfjEX^) using the samples generated from the target chain X^- ' . The 
results are shown in Table 2 (the upper half ) . It is clear that the EE sampler 
provided more accurate estimates with smaller mean squared errors. 

To compare the mixing speed for the two sampling algorithms, we counted 
in each of the 20 runs how many times the samples visited each mode in the 
last 2000 iterations. We then calculated the absolute frequency error for each 
mode, errj = \fi — 0.05|, where fi is the sample frequency of the ith mode 
(i = l,..., 20) being visited. For each mode i, we calculated the median and 
the maximum of errj over the 20 runs. Table 3 reports, for each mode, the 
ratio Rl of the median frequency error of PT over that of EE, and the ratio 
R2 of the maximum frequency error of PT over the corresponding value of 
EE. A two- to fourfold improvement by EE was observed. We also noted that 
EE did not miss a single mode in all runs, whereas PT missed some modes 
in each run. Table 3 also shows, out of the 20 runs, how many times each 
mode was missed by PT; for example, mode 1 was missed twice by PT over 
the 20 runs. The better global exploration ability of the EE sampler is thus 
evident. To further compare the two algorithms by their convergence speed, 
we tuned the temperature ladder for PT to achieve the best performance 
of PT; for example, we tuned the highest temperature T4 of PT in the 
range [5, 100]. We observe that the best performance of PT such that it is 
not trapped by some local modes in 50,000 samples and that it achieves 
minimal sample autocorrelation is the setting associated with T4 = 10. The 
sample autocorrelations of the target chain X^ from the EE sampler and 
the optimal PT are shown in Figures 3(c) and (d), respectively; evidently 
the autocorrelation of the EE sampler decays much faster even compared 
with a well-tuned PT. 

We also use this example to study the choice of p ee . We took p ee = 0.1, 0.2, 
0.3 and 0.4, and ran the EE sampler 20 times independently for each value 
of Pee- From the average mean squared errors for estimating (EX%,EX2) 
and (EXi, EX%) we find that the EE sampler behaves well when p ee is 
between 0.1 and 0.3. When p ee is increased to 0.4, the performance of the 
EE sampler worsens. In addition, we also noticed that the performance of 
the EE sampler is not sensitive to the value of p ee , as long as it is in the 
range [0.05,0.3]. 

Next, we changed the weight and variance for each component in (6) such 
that Wi tx 1/di and of = dj/20, where di = — (5,5)'||. The distributions 
closer to (5,5) have larger weights and lower energy. We used this example 
to test our strategy of dynamically updating the energy and temperature 
ladders (Section 3.3). We set the initial energy lower bound Hq = 3, which 
is higher than the energy at any of the 20 modes (in practice, we could 
try to get a better initial value of Hq by some local optimization). The 
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Fig. 3. Mixture normal distribution with equal weights and variances. The sample path 
of the last 2000 iterations for (a) EE sampler and (b ) parallel tempering. Autocorrelation 
plots for the samples from (c) EE sampler and (d) optimal parallel tempering. 



Table 2 

Comparison of the EE sampler and PT for estimating the mixture normal distributions 

EX± EX^ EX\ EX\ 

True value 4.478 4.905 25.605 33.920 

EE 4.5019 (0.107) 4.9439 (0.139) 25.9241 (1.098) 34.4763 (1.373) 

PT 4.4185 (0.170) 4.8790 (0.283) 24.9856 (1.713) 33.5966 (2.867) 

MSE(PT)/MSE(EE) 2.7 3.8 2.6 3.8 

True value 4.688 5.030 25.558 31.378 

EE 4.699 (0.072) 5.037 (0.086) 25.693 (0.739) 31.433 (0.839) 

PT 4.709 (0.116) 5.001 (0.134) 25.813 (1.122) 31.105 (1.186) 

MSE(PT)/MSE(EE) 2.6 2.5 2.4 2.1 

The numbers in parentheses are the standard deviations from 20 independent runs. The 
upper and bottom halves correspond to equal and unequal weights and variances, respec- 
tively. 
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Table 3 

Comparison of mixing speed of the EE sampler and PT for the mixture normal 

distribution 





Mi 


M2 


Ma 


M4 


Ms 


Me 


Mr 


Ms 


M9 


Mio 


Rl 


2.8 


2.9 


3.2 


3.6 


2.7 


4.1 


1.7 


2.1 


2.7 


3.3 


R2 


1.6 


4.5 


2.5 


4.9 


2.5 


2.1 


2.2 


2.3 


2.3 


2.5 


PT m is 


2 


3 


1 


5 


4 





2 


2 


3 


1 




Mn 


Ml2 


Ml3 


Ml4 


Mis 


Ml6 


Ml7 


Mis 


Ml9 


M20 


Rl 


1.3 


2.5 


2.0 


4.4 


2.2 


2.5 


2.0 


1.9 


2.6 


1.4 


R2 


3.8 


1.2 


1.9 


2.1 


3.2 


3.0 


3.1 


3.0 


3.9 


2.0 


PT mis 


1 


3 


1 


1 


1 


3 


4 


6 


1 


3 



721 is the ratio of median frequency error of PT over that of EE; R2 is the ratio of maximum 
frequency error of PT over that of EE. PT m j s reports the number of runs for which PT 
missed the individual modes. The EE sampler did not miss any of the 20 modes. All the 
statistics are calculated from the last 2000 samples in 20 independent runs. 



highest energy level and temperature were set at 100 and 20, respectively. 
We started with five chains and dynamically added more chains if necessary. 
The strategy for automatically updating the proposal step size was applied 
as well. After drawing the samples we also calculated the first two sample 
moments from the target chain as simple estimates for the theoretical 
moments. The mean and standard deviation of the estimates based on 20 
independent EE runs, each consisting of 10,000 iterations, are shown in Table 
2 (the bottom half). The sample path for the last 1000 iterations and the 
autocorrelation plots are shown in Figure 4. 

For comparison, PT was applied to this unequal weight case as well. PT 
used the same total number of chains as the EE sampler. The MH step 
size of PT was tuned to achieve the same acceptance rate. The temperature 
ladder of PT was also tuned so that the exchange operator in PT had the 
same acceptance rate as the equi-energy jump in EE, similarly to what we 
did in the previous comparison. With these well-tuned parameters, we ran 
PT for the same number of iterations and calculated the first two sample 
moments as we did for the EE samples. The results are reported in Table 2. 
It is seen that the EE sampler with the self-adaptation strategies provided 
more precise estimates (both smaller bias and smaller variance) in all the 
cases. Similar improvements in mixing speed and autocorrelation decay were 
also observed (Figure 4). 

4. Calculating the density of states and the Boltzmann averages. The 

previous section illustrated the benefit of constructing the energy rings: It 
allows more efficient sampling through the equi-energy jump. By enabling 
one to look at the states within a given energy range, the energy rings also 
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Fig. 4. Mixture normal distribution with unequal weights and variances. Autocorrelation 
plots for the samples from (a,) EE sampler and (b ) parallel tempering. The sample path 
for the last 1000 iterations for (c) EE sampler and (A) parallel tempering. 

provide a direct means to study the microscopic structure of the state space 
on an energy- by-energy basis, that is, the microcanonical distribution. The 
EE sampler and the energy rings are thus well suited to study problems in 
statistical mechanics. 

Starting from the Boltzmann distribution (1) one wants to study vari- 
ous aspects of the system. The density of states Q.(u), whose logarithm is 
referred to as the microcanonical entropy, plays an important role in the 
study, because in addition to the temperature-energy duality depicted in 
Section 2, many thermodynamic quantities can be directly calculated from 
the density of states, for example, the partition function Z(T), the inter- 
nal energy U(T), the specific heat C(T) and the free energy F{T) can be 
calculated via 



Z{T) = / n{u)e- u/T du, 
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fuSl(u)e-"/ T du 



J Q(u)e~ u / T du ' 

j u 2 VL(u)e~ u l T du ( J u£l{u)e- u ' T du\ 2 ' 



_ dU(T) 1 
C[I) - dT ~T* 

F(T) = -T\og(Z(T)). 



\ u z ll(u)e- u l 1 du ( J uil{u)e- U ^ du\' 
J n(u)e- u / T du \ fn(u)e- u / T du J 



Since the construction of the energy rings is an integral part of the EE 
sampler, it leads to a simple way to estimate the density of states. Suppose 
we have a discrete system to study, and after performing the EE sampling 
on the distributions 

iTi(x) ocexp(-hi(x)), hi(x) = —(h(x)V Hi), < i < K, 

-Li 

we obtain the energy rings Dj (0 <i,j< K), each Dj corresponding to 
an a priori energy range [Hj,Hj + i). To calculate the density of states for 
the discrete system we can further divide the energy rings into subsets such 
that each subset corresponds to one energy level. Let mi u denote the to- 
tal number of samples in the ith chain that have energy u. Clearly 
J2 u e[H ■ ,H j+1 ) m iu = l-^j^l- Under the distribution 7Tj 

Sl(u)e-( uWHi )/ Ti 

(7) P^h(X)=u)- 



Z v n(v)e-( v vHi)/Ti- 



Since the density of states tt(u) is common for each 7Tj, we can combine 
the sample chains < % < K, to estimate Q(u). Denote rrtj. = J2u m iu, 
m»u = J2i m iu an d o,i u = e -( uVH i)/ T i for notational ease. Pretending we have 
independent multinomial observations, 



i \ u- ■ \( Sl{u)ai 

{. . . , rrii U , ■ ■ ■) ~ multinomial I m^.; 



(8) 

the MLE of n(u) is then 
(9) 



0,1,. ..,K, 



Cl = argmax j^m. u log(Q(u)) -^m*. log tt(v)ai^J j. 

Since Q(u) is specified up to a scale change [see (7)], to estimate the relative 
value we can without loss of generality set £1(uq) = 1 for some no- The first- 
order condition of (9) gives 

(10) ^ - V mi ' aiu = for all u, 
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which can be used to compute £l(u) through a simple iteration, 



(ii) n(u) = m. u /J2 s 

A careful reader might question the independent multinomial assumption 
(9). But it is only used to motivate (10), which itself can be viewed as a 
moment equation and is valid irrespective of the multinomial assumption. 

With the density of states estimated, suppose one wants to investigate 

, t-i 1, 7-1/ / m\ ^ 9( x ) ey °p( — h'{x)/T) 

how the Boltzmann average E{g{X);l) = ^ exp ( h(x)/T) — varies as a 
function of temperature T (e.g., phase transition). Then we can write (see 
Lemma 1) 



£„S!(u)e-/T ' 

To estimate the microcanonical average v g {u) = E(g(X)\h(X) = u), we can 
simply calculate the sample average over the energy slice {x : h{x) = u} for 
each chain X^ , . . . , X ^ and combine them using weights proportional to 
the energy slice sample size for i = 0,1, . . . , K . 

So far we have focused on discrete systems. In continuous systems to esti- 
mate Q(u) and v g {u) = E(g(X)\h(X) = u) we can simply discretize the en- 
ergy space with an acceptable resolution and follow the preceding approach 
to use the EE sampler and the energy rings. 

To illustrate our method to calculate the density of states and the Boltzmann 
averages, let us consider a multidimensional normal distribution /(x;T) = 
■T^yexpf— /i(x)/T] with temperature T and the energy function /i(x) = 

\Y^i x l- This corresponds to a system of n uncoupled harmonic oscilla- 
tors. Since the analytical result is known in this case, we can check our 
numerical calculation with the exact values. Suppose we are interested in 
estimating E(Xl;T) and the ratio of normalizing constants Z(T)/Z{1) for 
T in [1, 5]. The theoretical density of states (from the x 2 result in this case) is 
O(n) oc u™/ 2 " 1 and the microcanonical average v g (u) = E(Xf\h(X.) = u) = 
2u/n. Our goal is to estimate Sl{u) and v g {u) and then combine them to 
estimate both E(Xf;T) and Z(T)/Z(l) as functions of the temperature T. 

We took n = 4 and applied the EE sampler with five chains to sam- 
ple from the four-dimensional distribution. The energy levels Hq, Hi, . . . , H4 
were assigned by a geometric progression along the interval [0,50]. The tem- 
peratures were accordingly set between 1 and 20. The equi-energy jump 
probability p cc was taken to be 0.05. Each chain was run for 150,000 itera- 
tions (the first 50,000 iterations were the burn-in period); the sampling was 
repeated ten times to calculate the standard error of our estimates. Since 
the underlying distribution is continuous, to estimate O(w) and v g (u) each 
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energy ring was further evenly divided into 20 small intervals. The estimates 
f)(it) and g (u) were calculated from the recursion (11) and the combined 
energy slice sample average, respectively. Figure 5(a) and (b) shows Q(u) 
and Vg(u) compared with their theoretical values. Using Q(u) and v g (u), we 
estimated E(Xf;T) and Z(T)/Z(1) by 



E(Xl-T) 



E u O g (uMu)e- u / T At 



J2 u n{u)e- u / T Au 
Z(T) _ J2 u n(u)e~ u / T Au 




Fig. 5. The density of states calculation for the four- dimensional normal distribution, 
(a) Estimated density of states fi(u); (b) estimated micro- canonical average v{u); (c) 
estimated E(X?;T) and (d) estimated Z(T)/Z(1) for T varying from 1 to 5. The error 
bars represent plus or minus twice the standard deviation from ten independent runs. 
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Letting T vary in [1, 5] results in the estimated curves in Figure 5(c) and (d). 
One can see from the figure that our estimations are very precise compared 
to the theoretical values. In addition, our method has the advantage that 
we are able to construct estimates for a wide temperature range using one 
simulation that involves only five temperatures. 

We further tested our method on a four-dimensional normal mixture dis- 
tribution with the energy function 

(12) h(x) = -log[exp(-||x - /ij 2 ) + 0.25exp(-||x - /i 2 || 2 )], 

where /i 1 = (3, 0, 0, 0)' and /i 2 = (—3, 0, 0, 0)'. For the Boltzmann distribution 
^T^j exp(— /i(x)/T), we are interested in estimating the probability P(X\ > 
0; T) and studying how it varies as T changes. It is easy to see if T = 1 this 
probability will be 0.8 = 1/1.25, the mixture proportion, and it decreases as 
T becomes larger. We applied the EE sampler to this problem with the same 
parameter setting as in the preceding 4D normal example. Using the energy 
rings, we calculated the estimates Q(u) and v g {u) and then combined them 
to estimate P(X\ > 0;T). Figure 6 plots the estimates. It is interesting to 
note from the figure that the density of states for the mixture model has a 
change point at energy u = 1.4. This is due to the fact that the energy at the 
mode is about 1.4(~ — log 0.25), and hence for u < 1.4 all the samples are 
from the first mode and for u > 1.4 the samples come from both modes, 
whence a change point appears. A similar phenomenon occurs in the plot 
for u g {u) = P{Xi > 0|/i(X) = u). The combined estimate of P(X X > 0;T) 
in Figure 6 was checked and agreed very well with the exact values from 
numerical integration. 

5. Statistical estimation with the EE sampler. In statistical inference, 
usually after obtaining the Monte Carlo samples the next goal is to estimate 
some statistical quantities. Unlike statistical mechanical considerations, in 
statistical inference problems often one is only interested in one target dis- 
tribution (i.e., only one temperature). Suppose the expected value E no g(X) 
under the target distribution ttq = tt is of interest. A simple estimate is the 
sample mean under the chain X^ (T = 1). This, however, does not use 
the samples optimally in that it essentially throws away all the other sam- 
pling chains X^, . . . ,1^. With the help of the energy rings, in fact all 
the chains can be combined together to provide a more efficient estimate. 
One way is to use the energy-temperature duality as discussed above. Here 
we present an alternative, more direct method: We directly work with the 
(finite number of) expectations within each energy set. For a continuous 
problem, this allows us to avoid dealing with infinitesimal energy intervals, 
which are needed in the calculation of micro canonical averages. 
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Fig. 6. Four- dimensional normal mixture distribution, (a) Estimated density of states 
fl(u); (h) estimated microcanonical average 0{u); (c) detailed plot for Cl{u) around 
the change point; (A) detailed plot for 0(u) around the change point; (e) estimated 
P(Xi > 0;T) for T in [1. 5]. For comparison, the theoretical Q(u) from the 4D normal 
example is also shown in (a) and (c). The error bars represent plus or minus twice the 
standard deviation from ten independent runs. 
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The starting point is the identity 

K 

(13) E no g(X) = J2PjEn {g(X)\X G Dj}, 

j=0 

where pj = P no (X G Dj ) , which suggests that we can first estimate pj and 
Gj = E nQ {g(X)\X G Dj}, and then conjoin them. A naive estimate of Gj is 

the sample average of g(x) within the energy ring bf\ But a better way is 

* (i) 

to use the energy rings Dj for i = 0, 1, . . . , K together, because for each i 
the importance-weighted 



£ Vcf)(l ) 5 (A>W(X) 



G 



(i) _ ^ x * D i 

£ (X) 

i 



where w^(x) = exp{hi(x) — h(x)}, is a consistent estimate of Gj. We can 
thus put proper weights on them to combine them. 

Since it is quite often that a number of different estimands g are of interest, 

- (i) 

a conceptually simple and effective method is to weig ht Gy, i = 0,l,...,K, 
proportionally to their effective sample sizes [15], 

\D {l) \ 

> l + Y^ Vi {w^{X)\X G D j }/{E 1Ti {w^{X)\X G Dj}) 2 ' 

which measures the effectiveness of a given importance-sampling scheme, 
and does not involve the specific estimand. ESS^ can be easily estimated 
by using the sample mean and variance of (X) for X in the energy ring 

bf. 

To estimate the energy-ring probability pj = P no (X G Dj) , one can use 
Le sample size proportioi 
chains, because for any i, 



the sample size proportion K 3 (0) . But again it is better to use all the 



if= X£ °; . , where m=\jtf 

is a consistent estimate of pj . To combine them properly, we can weight them 
inversely proportional to their variances. A simple delta method calculation 
(disregarding the dependence) gives the asymptotic variance of pj , 

v {i) _ 1 E nt [(I(X€D 3 )-p 3 )w^(X)] 2 



m [E^w^{x)f 



EQUI-ENERGY SAMPLER 



25 



where rii is the total number of samples under the ith chain X® . Vj can 
be estimated by 



where pj is a consistent estimate of pj . Since Vj requires an a priori pj , a 
simple iterative procedure can be applied to obtain a good estimate of pj, 

starting from the naive Pj ■ In practice to ensure the numerical stability 

of the variance estimate Vj it is recommended that pj- be included in 

the combined estimate only if the sample size \Dj \ is reasonably large, say 
more than 50. When we get our estimates for pj, we need to normalize them 
before plugging in (13) to calculate our combined estimates. 

Since our energy-ring estimate of E 7T0 g(X) employs all the sampling chains, 
one expects it to increase the estimation efficiency by a large margin com- 
pared with the naive method. 

The two-dimensional normal mixture model (continued). To illustrate 
our estimation strategy, consider again the 2D mixture model (6) in Sec- 
tion 3.4 with equal weights and variances a\ = ■ ■ ■ = 020 = 0.1, w\ = ■ ■ ■ = 
W20 = 0.05. 

Suppose we want to estimate the functions EXf, EX%, Ee~ 10Xl and 
Ee~ 10X2 and the tail probabilities 



p x = P(X 1 > 8.41, X 2 < 1.68 and yf(Xi - 8.41) 2 + (X 2 - 1.68) 2 > 4a), 

Table 4 

Comparison of the energy-ring estimates with the naive estimates that use X (0) only 



True value 


EX\ 


EXl 


Ee- 10X i 


Ee -iox 2 


Pi 


Pi 


25.605 


33.920 


9.3e-7 


0.0378 


4.2e-6 


6.7e-5 


Energy ring 


25.8968 


34.2902 


8.8e-7 


0.0379 


4.5e-6 


6.4e-5 


estimates 


(0.9153) 


(1.1579) 


(1.2e-7) 


(0.0044) 


(1.5e-6) 


(2.0e-5) 


Naive 


25.9241 


34.4763 


8.7e-7 


0.0380 


1.0e-5 


7.3e-5 


estimates 


(1.0982) 


(1.3733) 


(1.5e-7) 


(0.0052) 


(2.5e-5) 


(6.2e-5) 


MSER 


71% 


67% 


57% 


72% 


0.34% 


11% 



The numbers in parentheses are the standard deviations from 20 independent runs. MSER 
is defined as MSE\j MSE2, where MSE\ and MSE2 are the mean squared errors of the 
energy-ring estimates and the naive estimates, respectively. 
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p 2 = P(Xf+X 2 2 >175). 

After obtaining the EE samples (as described in Section 3.4), we calculated 
our energy-ring estimates. For comparison, we also calculated the naive esti- 
mates based on the target chain only. The calculation was repeated for 
the 20 independent runs of the EE sampler under the same parameter set- 
tings. Table 4 reports the result. Evidently, the energy-ring estimates with 
both smaller bias and smaller variance are much more precise than the naive 
estimates. The mean squared error has been reduced by at least 28% in all 
cases. The improvement over the naive method is particularly dramatic in 
the tail probability estimation, where the MSEs experienced a more than 
ninefold reduction. 



6. Applications of the EE sampler. We will apply the EE sampler and 
the estimation strategy to a variety of problems in this section to illustrate 
its effectiveness. The first example is a mixture regression problem. The 
second example involves motif discovery in computational biology. In the 
third one we study the thermodynamic property of a protein folding model. 

6.1. Mixture exponential regression. Suppose we observe data pairs 
(Y,X) = {(yi,xi), . . . , (y n ,x n )} from the mixture model 



(14) 



J Exp[#i(xj)], with probability a, 
\ Exp[#2(xj)], with probability I — a, 



where #j(xj) = exp[/3jxj] (J = 1,2), Exp(#) denotes an exponential distri- 
bution with mean 9 and (3 l , f3 2 arm x « (i = 1,2, . . . ,n) are p-dimensional 
vectors. Given the covariates Xj and the response variable yi, one wants to 
infer the regression coefficients /3 1 and /3 2 and the mixture probability a. 
The likelihood of the observed data is 

P(Y,X\a, 13^/32) 

a ( in \ .1 - a 



i=l 



exp — — — - + — — - exp 



0i(xo flic*); o 2 ( Xl ) 2 (x. 



If we put a Beta(l, 1) prior on a, and a multivariate normal N(0, <r 2 I) on f3j 
(j = 1,2), the energy function, defined as the negative log-posterior density, 
is 

h(a,(3i,f3 2 ) = -logP(a,/3 1 ,/3 2 |Y,X) 

(16) 

^ 2 p 

= -logP(Y,X\a,(3 1 ,(3 2 ) + ^ Y.Y.P%+ C > 

k=ij=i 
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where C is a constant. Since h(a, /3 2 ) = — a,/3 2 ,/3i) (i-e., noniden- 
tifiable), the posterior distribution has multiple modes in the (2p + 1)- 
dimensional parameter space. This example thus serves as a good model 
to test the performance of the EE sampler in high-dimensional multimodal 
problems. 

We simulated 200 data pairs with the following parameter setting: a = 0.3, 
(3 1 = (1,2)', j3 2 = (4,5)' and x» = (l,Uj)', with the u^s independently drawn 
from Unif(0,2). We took a 2 = 100 in the prior normal distributions for the 
regression coefficients. After a local minimization of the energy function (16) 
from several randomly chosen states in the parameter space, the minimum 
energy H m \ a is found to be around H m \ n « —1740.8 (this value is not crucial 
in the EE sampler, since it can adaptively adjust the energy and temper- 
ature ladder; see Section 3.3). We then applied the EE sampler with eight 
chains to sample from the posterior distribution. The energy ladder was set 
between H m \ n and H m \ n + 100 in a geometric progression, and the temper- 
atures were between 1 and 30. The equi-energy jump probability p ee was 
taken to be 0.1. Each chain was run for 15,000 iterations with the first 5000 
iterations serving as a burn-in period. The overall acceptance rates for the 
MH move and the equi-energy jump were 0.27 and 0.80, respectively. Figure 
7(a) to (c) shows the sample marginal posterior distributions of a, fii and 
(3 2 from the target chain X^ ' . It is clear that the EE sampler visited the 
two modes, equally frequently in spite of the long distance between the two 
modes, and the samples around each mode were centered at the true pa- 
rameters. Furthermore, since the posterior distribution for this problem has 
two symmetric modes in the parameter space due to the nonidentifiability, 
by visiting the two modes equally frequently the EE sampler demonstrates 
its capability of global exploration (as opposed to being trapped by a local 
mode). 

For comparison, we also applied PT with eight chains to this problem 
under the same parameter setting, where the acceptance rates for the MH 
move and the exchange operator were 0.23 and 0.53, respectively. To com- 
pare their ability to escape local traps, we calculated the frequency with 
which the samples stayed at one particular mode in the lowest temperature 
chain (To = 1). This frequency was 0.55 for the EE samplers and 0.89 for 
PT, indicating that the EE sampler visited the two modes symmetrically, 
while PT tended to be trapped at one mode for a long time. We further 
tuned the temperature ladder for PT with the highest temperature varying 
from 10 to 50. For each value of the highest temperature, the temperature 
ladder was set to decrease with a geometric rate. We observed that the sam- 
ple autocorrelations decrease with the decrease of the highest temperature, 
that is, the denser the temperature ladder, the smaller the sample autocor- 
relation. However, the tradeoff is that with lower temperature, PT tends to 
get trapped in one local mode. For instance, PT was totally trapped to one 
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Fig. 7. Statistical inference for the mixture exponential regression model. Marginal pos- 
terior distribution for ( a,) a; (b) f3\\; and (c) /3i2 (the marginal distributions for (3 2 are 
similar to those for f3 1 and thus are not plotted here). Autocorrelation plot for the samples 
from (A) EE sampler; (e) PT with T 7 = 30; and (i ) PT with T 7 = 10. 

particular mode if we set the highest temperature to be 10 although with 
this temperature ladder PT showed autocorrelation comparable with that 
of EE. On the other hand, if we increased the temperatures, PT would be 
able to jump between the two modes, but the autocorrelation also increased 
since the exchange rates became lower. But even with the highest temper- 
ature raised to 50, 80% of the PT samples were trapped to one mode and 
the posterior distributions were severely asymmetric. The autocorrelation 
plots for Pn are compared in Figure 7(d) and (e), where one sees that au- 
tocorrelation of the EE sampler decays faster than that of PT. We also plot 
the autocorrelation for PT with highest temperature Tj = 10 in Figure 7(f), 
which corresponded to the minimal autocorrelation reached by PT among 
our tuning values, but the local trapping of PT with this parameter setting 
is severely pronounced. 

6.2. Motif sampling in biological sequences. A central problem in biology 
is to understand how gene expression is regulated in different cellular pro- 
cesses. One important means for gene regulation is through the interaction 
between transcription factors (TF's) and their binding sites (viz., the sites 
that the TF recognizes and binds to in the regulatory regions of the gene). 
The common pattern of the recognition sites of a TF is called a binding mo- 
tif, whose identification is the first step for understanding gene regulation. It 
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is very time-consuming to determine these binding sites experimentally, and 
thus computational methods have been developed in the past two decades 
for discovering novel motif patterns and TF binding sites. 

Some early methods based on site consensus used a progressive alignment 
procedure [36] to find motifs. A formal statistical model for the position- 
specific weight matrix (PWM)-based method was described in [21] and a 
complete Bayesian method was given in [25]. Based on a missing data formu- 
lation, the EM algorithm [1, 21] and the Gibbs sampler [20] were employed 
for motif discovery. The model has been generalized to discover modules of 
several clustered motifs simultaneously via data augmentation with dynamic 
programming [41]. See [14] for a recent review. 

From a sampling point of view motif discovery poses a great challenge, 
because it is essentially a combinatorial problem, which makes the samplers 
very vulnerable to being trapped in numerous local modes. We apply the 
EE sampler to this problem under a complete Bayesian formulation to test 
its capability. 

6.2.1. Bayesian formulation of motif sampling. The goal of motif dis- 
covery is to find the binding sites of a common TF (i.e., positions in the 
sequences that correspond to a common pattern). Let S denote a set of 
M (promoter region) sequences, each sequence being a string of four nu- 
cleotides, A, C, G or T. The lengths of the sequences are L±,L2, and 
Lm- For notational ease, let A = {Aij,i = 1,2,..., M, j = 1,2,..., Li} be the 
indicator array such that A^ = 1 if the jth position on the ith sequence is 
the starting point for a motif site and = otherwise. In the motif sam- 
pling problem, a first-order Markov chain is used to model the background 
sequences; its parameters 6q (i.e., the transition probabilities) are estimated 
from the entire sequence data prior to the motif search. We thus effectively 
assume that 6q is known a priori. Given A, we further denote the aligned 
motif sites by S(A), the nonsite background sequences by S(A C ) and the 
total number of motif sites by |A|. The motif width w is treated as known. 
The common pattern of the motif is modeled by a product multinomial dis- 
tribution © = (61,62, . . . ,6 W ), where each 6{ is a probability vector of length 
4 for the preferences of the nucleotides (A, C, G,T) in the motif column i. 
See Figure 8 for an illustration of the motif model. 

To write down the joint distribution of the complete data and all the 
parameters, it is assumed a priori that a randomly selected segment of width 
w has probability po to be a motif site (po is called the "site abundance" 
parameter). The joint distribution function has the form 



P(S, A, e, P0 ) = P(S\A, @)P(A\p )7T(&)7T(p ) 



(17) 
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1 uu 

° C P(S(A)|A,fl )ii * 

where we put a product Dirichlet prior 7r(0) with parameter . . . , /3 W ) on 
(each /3j is a length-4 vector corresponding to A, C, G, T), and a Beta(a, 6) 
prior 7r(po) on po- In (17), = 1,2, . . . ,io) is the count vector for the ith 
position of the motif sites S(A) [e.g., ci = {c\a,c\c-, c\g,c\t) counts the total 
number of A, C, G, T in the first position of the motif in all the sequences] , 

the notation 6>f !+ft) = Uj 0$ i+Pii) (j = A, C, G, T), and L = E*=i U. Since 
the main goal is to find the motif binding sites given the sequence, P(A|S) 
is of primary interest. The parameters (0 and po) can thus be integrated 
out resulting in the following "collapsed" posterior distribution [14, 24]: 

P(A|S) oc 



P(S(A)|A,0 o ) 
(18) 

r(|A| + a)r(L-|A| + &) " r(c t + /3 t ) 

T(L + a + b) llrdAl + l/3,1)' 

where r(c, + ft) = n, r(cy + fa) (j = A, C, G, T) and |ft| = J2j Pij- 

6.2.2. The equi-energy motif sampler. Our target is to sample from P(A|S) 
The EE sampler starts from a set of modified distributions, 

/,(A)ocexpf- MA l V ^ Y i = 0,l,...,K, 



— 'TTTGC — 
— TAICC — 
— CTTGC — 

-mu- 

— GTTGC — 

Fig. 8. (a.) The sequences are viewed as a mixture of motif sites (the rectangles) and 
background letters, (b) Motif model is represented by PWM, or equivalently, a product 
multinomial distribution, where we assume that the columns within a motif are indepen- 
dent. 
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where h(A) = — logP(A|S) is the energy function for this problem. At the 
target chain (i = 0, To = 1), we simply implement the Gibbs sampler, that 
is, sequentially sample the indicator array A one position at a time with the 
rest of the positions fixed. To update other sampling chains at i = 1, 2, . . . , K, 
given the current sample A we first estimate the motif pattern by a 
simple frequency counting with some extra pseudocounts, where the number 
of pseudocounts increases linearly with the chain index so that at higher- 
order chains is stretched more toward a uniform weight matrix. Next, 
we fix fio = 1/L, where L is the average sequence length. Given and po, 
we sample each position in the sequences independently to obtain a new 
indicator array A* according to the Bayes rule: Suppose the nucleotides at 
positions j to j + w — 1 in sequence k are x±x<2 • ■ -x w . We propose A* k - = 1 
with probability 

Mn=l®n Xn 

Ikj — ' i 

PO Iln=l ®nx n + (1 - P0)P{xi " ' ' X W \Q Q ) 

where P(x\ ■ ■ -x^Oo) denotes the probability of generating these nucleotides 
from the background model. Then A* is accepted to be the new indicator 
array according to the Metropolis-Hastings ratio 

(19) : — fi(A*) P(A|0*) 



/i(A) P(A*|0) 

where P(A*|0) = Ylkj qkj denotes the proposal probability of generating the 
sample A* given current A. In addition to the above updating, the EE motif 
sampler performs the equi-energy jump in each iteration with probability p cc 
to help the sampler move freely between different local modes. 



6.2.3. Sampling motifs in "low complexity" sequences. In the genomes 
of higher organisms, the presence of long stretches of simple repeats, such as 
AAAA. . . or CGCGCG . . . often makes the motif discovery more difficult, 
because these repeated patterns are local traps for the algorithms — even the 
most popular motif finding algorithms based on the Gibbs sampler, such 
as BioProspector [26] and AlignACE [33], are often trapped to the repeats 
and miss the true motif pattern. To test whether the EE motif sampler is 
capable of finding the motifs surrounded by simple repeats, we constructed 
a set of sequences with the following transition matrix for the background 
model: 



(20) 6>o = 



— 3a 
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1 — 3a 


a a 
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1 — 3a a 
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a 1 — 3a 
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where a = 0.12. The data set contained ten sequences, each of length 200 
base pairs (i.e., L\ = Li = ■ ■ ■ = L\q = 200). Then we independently gener- 
ated 20 motif sites from the weight matrix whose logo plot [34] is shown in 
Figure 9(a) and inserted them into the sequences randomly. 

The EE motif sampler was applied to this data set with w = 12. To set 
up the energy and temperature ladder, we randomly picked 15 segments 
of length w in the sequences, treated them as motif sites, and used the 
corresponding energy value as the upper bound of the energy Hr. For the 
lower bound Hq, a rough value was estimated by calculating the energy 
function (18) for typical motif strength with a reasonable number of true 
sites. Note that since the EE sampler can adaptively adjust the energy and 
temperature ladder (see Section 3.3), the bound Hq does not need to be 
very precise. After some trials Hq was set to be —50. We utilized K + 1 = 5 
chains. The energy ladder Hj was set by a geometric progression between 
Hq and Hk- The temperatures were set by (-ffj+i — Hj)/Tj = 5. The equi- 
energy jump probability p ee was taken to be 0.1. We ran each chain for 
1000 iterations. Our algorithm predicted 18.4 true sites (out of 20) with 
1.0 false site on average over ten independent simulations — the EE sampler 
successfully found the true motif pattern to a large extent. The sequence 
logo plot of the predicted sites is shown in Figure 9(b), which is very close 
to the true pattern in Figure 9(a) that generates the motif sites. 

The performance of the EE motif sampler was compared with that of 
BioProspector and AlignACE, where we set the maximum number of motifs 
to be detected to 3 and each algorithm was repeated ten times for the data 
set. The motif width was set to be w = 12. Both algorithms, however, tend 
to be trapped in some local modes and output repeats like those shown 
in Figures 9(c) and (d). One sees from this simple example that the EE 
sampler, capable of escaping from the numerous local modes, could increase 
the global searching ability of the sampling algorithm. 

6.2.4. Sampling mixture motif patterns. If we suspect there are multiple 
distinct motif patterns in the same set of sequences, one strategy is to intro- 
duce more motif matrices, one for each motif type [25]. Alternatively, if we 
view the different motif patterns as distinct local modes in the sample space, 
our task is then to design a motif sampler that frequently switches between 
different modes. This task is almost impossible for the Gibbs sampler, since 
it can easily get stuck in one local mode (one motif pattern) and have no 
chance to jump to other patterns in practice. We thus test if the EE sampler 
can achieve this goal. 

In our simulation, we generated 20 sites from each of two different motif 
models with logo plots in Figures 10(a) and (b) and inserted them randomly 
into 20 generated background sequences, each of length 100. We thus have 
40 motif sites in 20 sequences. The energy ladder in the EE sampler was set 
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Fig. 9. Sequence logo plots, (a.) The motif pattern that generates the simulated sites, 
(b ) The pattern of the predicted sites by the EE motif sampler, (c) and (d) Repetitive 
patterns found by BioProspector and AlignACE. 



by a geometric progression in the range [0, 110] (this is obtained similarly to 
the previous example). The equi-energy jump probability p e e was set to 0.1. 
The EE motif sampler used 10 chains; each had 1000 iterations. We recorded 
the frequency of each position in the sequences being the start of a motif site. 
Figure 10(c) shows the frequency for the starting positions of the 40 true 
motif sites (i.e., the probability that each individual motif site was visited 
by the EE motif sampler). It can be seen that the EE sampler visited both 
motif patterns with a ratio of 0.4:0.6; by contrast, the Gibbs motif sampler 
can only visit one pattern in a single run. We further sorted all the positions 
according to their frequencies of being a motif-site-start in descending order. 
For any q between and 1, one could accept any position with frequency 
greater than q as a predicted motif site of the sampler. Thus by decreasing 
q from 1 to 0, the numbers of both true and false discovered sites increase, 
which gives the so-called ROC curve (receiver operating characteristic curve) 
that plots the number of false positive sites versus the number of true positive 
sites as q varies. Figure 10(d) shows the ROC curves for both the EE motif 
sampler and the Gibbs motif sampler. 

It can be seen from the figure that for the first 20 true sites, the two 
samplers showed roughly the same performance. But since the Gibbs sampler 
missed one mode, the false positive error rate increased dramatically when 
we further decreased q to include more true sites. By being able to visit 
both modes (motif patterns), the EE motif sampler, on the other hand, had 
a very small number of false positive sites until we included 38 true sites, 
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Fig. 10. fa,) and ('bj T«jo different motifs inserted in the sequences, (c) The marginal 
posterior distribution of the 40 true sites, which reports the frequencies that these true 
site-starting positions are correctly sampled during the iterations. Sites 1-20 are of one 
motif type; sites 21-40 are of the other motif type, (d) ROC curve comparison between 
the EE sampler and the Gibbs sampler. 

which illustrates that the EE motif sampler successfully distinguished both 
types of motif patterns from the background sequences. 

6.3. The HP model for protein folding. Proteins are heteropolymers of 
20 types of amino acids. For example, the primary sequence of a protein, 
cys-ile-leu-lys-glu-met-ser-ile-arg-lys, tells us that this protein is a chain of 
10 amino acids linked by strong chemical bonds (peptide bonds) in the back- 
bone. Each different amino acid has a distinct side group that confers differ- 
ent chemical properties to it. Under a normal cellular environment, the side 
groups form weak chemical bonds (mainly hydrogen bonds) with each other 
and with the backbone so that the protein can fold into a complex, three- 
dimensional conformation. Classic experiments such as the ribonuclease re- 
folding experiments [35] suggest that the three-dimensional conformations 
of most proteins are determined by their primary sequences. The problem 
of computing the 3D conformation from a primary sequence, known as the 
protein folding problem, has been a major challenge for biophysicists for over 
30 years. This problem is difficult for two reasons. First, there are uncer- 
tainties on how to capture accurately all the important physical interactions 
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such as covalent bonds, electrostatic interactions, and interaction between 
the protein and its surrounding water molecules, and so on, into a single 
energy function that can be used in minimization and simulation computa- 
tions. Furthermore, even if the energy function is not in question, we still do 
not have efficient algorithms for computing the protein conformation from 
the energy function. For these reasons, biophysicists have developed sim- 
plified protein folding models with greatly reduced complexity in both the 
conformational space and the energy function. It is hoped that the reduced 
complexity in these simplified models will allow us to deduce insights about 
the statistical mechanics of the protein folding process through extensive 
numerical computations. 

The HP model is a simplified model that is of great current interest. In this 
model, there are only two types of amino acids, namely a hydrophilic type 
(P-type) that is capable of favorable interaction with water molecules, and a 
hydrophobic type (H-type) that does not interact well with water. Thus, the 
primary sequence of the length-10 protein in the beginning of this section is 
simplified to the sequence H-H-H-P-P-H-P-H-P-P. The conformation of the 
protein chain is specified once the spatial position of each of its amino acids 
is known. In the HP model space is modeled as a regular lattice in two or 
three dimensions. Thus the conformation of a length-fc protein is a vector x = 
(xi,X2, • • • ,£jfc) where each xi is a lattice point. Of course, neighbors in the 
backbone must also be neighbors on the lattice. Figure 11 gives one possible 
conformation of the above length-10 protein in a two-dimensional lattice 
model. Just like oil molecules in water, the hydrophobic nature of the H-type 
amino acids will drive clusters together so as to minimize their exposure to 
water. Thus we give a favorable energy [i.e., an energy of £(xi,Xj) = — 1] for 
each pair of H-type amino acids that are not neighbors in the backbone but 
are lattice neighbors of each other in the conformation [see Figure 11(b)]. 
All other neighbor pairs are neutral [i.e., having an energy contribution 
£{xi,Xj) = 0; see Figure 11(a)]. The energy of the conformation is given by 

\i-j\>l 

Since its introduction by Lau and Dill [19], the HP model has been exten- 
sively studied and found to provide valuable insights [4]. We have applied 
equi-energy sampling to study the HP model in two dimensions. Here we 
present some summary results to illustrate how our method can provide 
information that is not accessible to standard MC methods, such as the rel- 
ative contribution of the entropy to the conformational distribution, and the 
phase transition from disorder to order. The detailed implementation of the 
algorithm, the full description and the physical interpretations of the results 
are available in a separate paper [16]. 
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Fig. 11. One possible conformation of the length-10 proteins in a 2D lattice. H-type and 
P-type amino acids are represented by black and gray squares, respectively. 

Table 5 presents estimated (normalized) density of states at various en- 
ergy levels for a protein of length 20: H-P-H-P-P-H-H-P-H-P-P-H-P-H-H- 
P-P-H-P-H. For this protein there are 83,770,155 possible conformations 
with energies ranging from (completed unfolded chain) to —9 (folded con- 
formations having the maximum number of nine H-to-H interactions). The 
estimated and exact relative frequencies of the energies in this collection 
of conformations are given in the table. The estimates are based on five 
independent runs each consisting of one million steps where the probabil- 
ity of proposing an equi-energy move is set to be p ec = 0.1. It is clear that 
the method yielded very accurate estimates of the density of states at each 
energy level even though we have sampled only a small proportion of the 
population of conformations. The equi-energy move is important for the suc- 
cess of the method: if we eliminate these moves, then the algorithm performs 
very poorly in estimating the entropy at the low-energy end; for example, 
the estimated density of state at E = —9 becomes (1.545 ± 4.539) x 10~ 12 , 
which is four orders of magnitude away from the exact value. 

We also use the equi-energy sampler to study phase transition from a 
disordered state, where the conformational distribution is dominated by the 
entropy term and the protein is likely to be in an unfolded state with high 
energy, to an ordered state, where the conformation is likely to be com- 
pactly folded structures with low energy. We use the "minimum box size" 
(BOXSIZE) as a parameter to measure the extent the protein has folded. 
BOXSIZE is defined as the size of the smallest possible rectangular region 
containing all the amino acid positions in the conformation. For the 20- 
length protein, Figure 12 gives a plot of estimated Boltzmann averages of 
BOXSIZE at ten different temperatures, which are available from a single 
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Table 5 

The normalized density of states estimated from the EE sampler compared with the 

actual value 



Estimated density 



Energy 


of states 


Actual value 


t-value 


-9 


(4.751 ±2.087) x 1(T 8 


4.774 x 10" 8 


-0.011 


-8 


(1.155 ±0.203) x 10~ 6 


1.146 x 10~ 6 


0.043 


-7 


(1.452 ±0.185) x 10~ 5 


1.425 x 10~ 5 


0.144 


-6 


(1.304 ±0.189) x 10~ 4 


1.237 x 10~ 4 


0.354 


-5 


(9.602 ± 1.332) x 10~ 4 


9.200 x 10~ 4 


0.302 


-4 


(6.365 ±0.627) x 10" 3 


6.183 x 10~ 3 


0.291 


-3 


(3.600 ±0.228) x 10~ 2 


3.514 x 10~ 2 


0.377 


-2 


(1.512 ±0.054) x 10 _1 


1.489 x 10 _1 


0.423 


-1 


(3.758 ± 0.044) x 10" 1 


3.779 x 10" 1 


-0.474 





(4.296 ±0.071) x lO^ 1 


4.309 x 10~ 4 


-0.181 



The i-value is denned as the difference between the estimate and the actual value divided 
by the standard deviation. 



run of the equi-energy sampler using five energy ranges. We see that there 
is a rather sharp transition from order (folded state) to disorder at the tem- 
perature range T = 0.25 to T = 1, with an inversion point around T = 0.5. 
In our energy scale, room temperature corresponds to T = 0.4 [16]. Thus at 
room temperature this length-20 protein will not always assume the min- 
imum energy conformation; rather, it still has a significant probability of 
being in high-energy, unfolded states. It is clear from this example that the 
equi-energy sampler is capable of providing estimates of many parameters 
that are important for the understanding of the protein folding problem 
from a statistical mechanics and thermodynamics perspective. 

7. Discussion. We have presented a new Monte Carlo method that is 
capable of sampling from multiple energy ranges. By using energy truncation 
and matching temperature and step sizes to the energy range, the algorithm 
can explore the energy landscape with great flexibility. Most importantly, 
the algorithm relies on a new type of move — the equi-energy jumps — to reach 
regions of the sample space that have energy similar to the current state but 
may be separated by steep energy barriers. The equi-energy jumps provide 
direct access to these regions that are not reachable with classic Metropolis 
moves. 

We have also explained how to obtain estimates of the density of states 
and the microcanonical averages from the samples generated by the equi- 
energy sampler. Because of the duality between temperature-dependent pa- 
rameters and energy-dependent parameters, the equi-energy sampler thus 
also provides estimates of expectations under any fixed temperature. 
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Fig. 12. T/ie Boltzmann average of BOXSIZE at different temperatures. 



Our method has connections with two of the most powerful Monte Carlo 
methods currently in use, namely parallel tempering and multicanonical 
sampling. In our numerical examples, equi-energy sampling is seen to be 
more efficient than parallel tempering and it provides estimates of quanti- 
ties (e.g., entropy) not accessible by parallel tempering. In this paper we 
have not provided direct numerical comparison of our method with multi- 
canonical sampling. One potential advantage over multicanonical sampling 
is that the equi-energy jumps allow us to make use of conformations gener- 
ated in the higher energy ranges to make movement across energy barriers 
efficiently. Thus equi-energy sampling "remembers" energy-favorable con- 
figurations and makes use of them in later sampling steps. By contrast, 
multicanonical sampling makes use of previous configurations only in terms 
of the estimated density of states function. In this sense our method has bet- 
ter memory than multicanonical sampling. It is our belief that equi-energy 
sampling holds promise for improved sampling efficiency as an all-purpose 
Monte Carlo method, but definitive comparison with both parallel temper- 
ing and multicanonical sampling must await future studies. 

In addition to its obvious use in statistical physics, our method will also be 
useful to statistical inference. It offers an effective means to compute pos- 
terior expectations and marginal distributions. Furthermore, the method 
provides direct estimates of conditional expectations given fixed energy lev- 
els (the microcanonical averages). Important information on the nature of 
the likelihood or the posterior distribution, such as multimodality, can be 
extracted by careful analysis of these conditional averages. For instance, in 
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Section 4 we see that the equi-energy sampler running on (12) reveals that 
there is a change point in the density of states as well as the micro canonical 
averages [see Figures 6(c) and (d)], which clearly indicates the multimodal- 
ity of the underlying distribution. We believe that the design of methods 
for inferring the properties of the likelihood surface or the posterior density, 
based on the output of the equi-energy sampler, will be a fruitful topic for 
future investigations. 
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